Uploading an csv file into HF

I have a csv file that contains two columns:

  1. Column-1: Titles (Image description/caption)
  2. Column-2: Images (Image URL)

Assume that in each row there is only one image URL and its corresponding description (however for my case one row may have multiple image URLs, I have to manually clean them).

Now, using this dataset, I want to fine tune the Salesforce BLIP model (base or BLIP-2). In order to do that, I have found a Google Colab notebook that uses a Football data for training.

There I can see a code that goes like load_dataset(repository/data, split = ‘train’). Now, I have some query regarding this.

  1. For uploading my data into HF repo, should my data contain image URLs, or they must be raw images?
  2. For doing the split = ‘train’, should i divide my complete data into train.csv and test.csv and then upload it in HF? Or there is some more codes to write or the load_dataset function automatically divides the dataset into train-test split as mentioned by split = ‘train’?

I am unable to find any proper documentation for this process and would be very helpful if someone helps me to resolve this issue.

Google Colab Notebook for Reference

1 Like

I think it’s better to download the data first, create a dataset that includes the actual images, and then upload it, as this reduces the risk of encountering download errors during training. However, either method should work. Additionally, there is an option to create a script for loading the dataset.

Ultimately, you need to decide whether to have the Trainer’s DataCollator download the data from the URL or to prepare the dataset in advance and use the datasets library.

1 Like